Modelling Project Success by Alfi, Jiaying, Himanshu, Sara


Drawing

Problem Solving Strategy

The ideal machine learning project involves general flow analysis stages for building a Predictive Model. Steps followed to perform data analysis:

  1. Understanding the problem domain
  2. Data Exploration and Preparation
  3. Feature Engineering
  4. Dimensionality Reduction (or Feature Selection)
  5. Various Model Evaluation and
  6. Hyper-parameter Tuning
  7. Ensembling: Model Selection

STEP 1. Understanding the problem domain

  • Kickstarter - Maintains a global crowdfunding platform focused on creativity (films, music, stage shows, comics, journalism, video games, technology and food-related projects)
  • People who back Kickstarter projects are offered tangible rewards or experiences in exchange for their pledges.

Question: why not a model to predict if a project will be successful, failed or cancelled based on given dataset?
List of possible predicting factors:

  • Total amount to be raised
  • Total duration of the project Campaign
  • Theme of the project
  • Writing style of the project description
  • Length of the project description
  • Project launch time
  • Obviously backers and pledged amount
In [4]:
%reset -f
#Load Pre-requisits
import sys
import os
import math
import pickle
import matplotlib
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import gc

import warnings
warnings.filterwarnings('ignore')
np.set_printoptions(threshold=sys.maxsize)

## Visualization libraries

import plotly.tools as tls
import plotly.offline as py
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
from collections import Counter

##Text Processing

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from textblob import TextBlob
import string
import re
stop_words = set(stopwords.words('english')) 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

#Feature Selection/Elimination
import statsmodels.formula.api as sm 
from sklearn.feature_selection import RFE 
from sklearn.linear_model import LassoCV

#Bagging and Boosting Algorithms, Evaluation Metric
!pip install imblearn
!pip install scipy
from imblearn.over_sampling import ADASYN
from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

#Algos
from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier ##SKLearn GBM - slower
import xgboost as xgb
from sklearn.ensemble import AdaBoostClassifier
import lightgbm as lgb
from sklearn.ensemble import VotingClassifier

##DR Tools
from sklearn.decomposition import PCA, TruncatedSVD, KernelPCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA 
from sklearn.pipeline import make_pipeline 
from sklearn.model_selection import cross_val_score
##Hyper

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from pprint import pprint
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hmnsh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hmnsh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Requirement already satisfied: imblearn in c:\programdata\anaconda3\lib\site-packages (0.0)
Requirement already satisfied: imbalanced-learn in c:\programdata\anaconda3\lib\site-packages (from imblearn) (0.4.3)
Requirement already satisfied: scipy>=0.13.3 in c:\programdata\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.1.0)
Requirement already satisfied: numpy>=1.8.2 in c:\programdata\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.16.2)
Requirement already satisfied: scikit-learn>=0.20 in c:\programdata\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (0.20.3)
Requirement already satisfied: scipy in c:\programdata\anaconda3\lib\site-packages (1.1.0)

STEP 2. Data Exploration and Preparation

  • Verified Individual Column values
  • Class Variable Distribution - (Selected canceled, failed, successful)
     - failed        52.22
     - successful    35.38
     - canceled      10.24
     - undefined      0.94
     - live           0.74
     - suspended      0.49

Cancelled State There are 10% of projects in this dataset are in cancelled state. Since there is no clear reason given in this dataset for Project to get cancelled or no date on which it got cancelled. here, Canceled state should be considered as separate state and not failed.

For Example, It Could be project owner getting funding from somewhere else or the project requirements changed which let him recreate online crowd funding campaign.

In [6]:
print ("Total Projects: ", df_ks.shape[0], "\nTotal Features: ", df_ks.shape[1])
df_ks.head()
Total Projects:  378661 
Total Features:  15
Out[6]:
ID name category main_category currency deadline goal launched pledged state backers country usd pledged usd_pledged_real usd_goal_real
0 1000002330 The Songs of Adelaide & Abullah Poetry Publishing GBP 2015-10-09 1000.0 2015-08-11 12:12:28 0.0 failed 0 GB 0.0 0.0 1533.95
1 1000003930 Greeting From Earth: ZGAC Arts Capsule For ET Narrative Film Film & Video USD 2017-11-01 30000.0 2017-09-02 04:43:57 2421.0 failed 15 US 100.0 2421.0 30000.00
2 1000004038 Where is Hank? Narrative Film Film & Video USD 2013-02-26 45000.0 2013-01-12 00:20:50 220.0 failed 3 US 220.0 220.0 45000.00
3 1000007540 ToshiCapital Rekordz Needs Help to Complete Album Music Music USD 2012-04-16 5000.0 2012-03-17 03:24:11 1.0 failed 1 US 1.0 1.0 5000.00
4 1000011046 Community Film Project: The Art of Neighborhoo... Film & Video Film & Video USD 2015-08-29 19500.0 2015-07-04 08:35:03 1283.0 canceled 14 US 1283.0 1283.0 19500.00

Data Cleaning and Noise Removal

  1. Get rid of unwanted columns (ID, goal, pledged, usd_pledged and currency )
  2. Remove Duplicates if exist
  3. Handle Missing Values, in this case Delete those rows
  4. Get rid of noise above 2200000 goal amount (all failed)
  5. Project launched during 1970 and 2018 (6 rows) can be removed
  6. Misrepresented data such as "N,0"" in country column must be addressed, it will be cleaned as a part of data cleaning.

Note: Name column has 4 Nan whereas usd pledged is got 3797 NaN values. This rows can be directly removed as dataset is big enough to perfrom analysis.

  • Before Cleaning: (378661, 15)
  • After Cleaning: (369678, 10)
In [47]:
def data_clean(df_ks):
    df_ks = df_ks.dropna() ## Drop the rows where at least one element is missing.
    df_ks = df_ks[df_ks["state"].isin(["failed", "successful", 'canceled'])] ## State - Successful and Failed
    df_ks = df_ks.drop(["ID", "currency", "pledged", "usd pledged", "goal"], axis = 1) ##Drop not useful columns
    df_ks = df_ks[df_ks['usd_goal_real']< 2200000] # Remove noise from the data
    return df_ks

print("Before Cleaning:", df_ks.shape)
df_clean = data_clean(df_ks)
print("After Cleaning:", df_clean.shape)
del df_ks ##  releasing system memory
gc.collect()
Before Cleaning: (378661, 15)
After Cleaning: (369678, 10)
Out[47]:
542
In [38]:
df_clean['Goal(USD Millions)'] = (df_clean['usd_goal_real'].astype(float)/1000000).astype(float)
df_clean['Pledged(USD Millions)'] = (df_clean['usd_pledged_real'].astype(float)/1000000).astype(float)

plt.figure(figsize=(12,6))
plt.suptitle('(Exploration) Goal vs Pledged Amount', fontsize=24)
#plt.annotate('After approximate 2000000 goal, none of them were successfull(Noise)', xy=(650000, 960000), xytext=(600000, 840000),arrowprops=dict(facecolor='black', shrink=0.05))
sns.set_style('whitegrid')
sns.set(font_scale=1.4)
ax = sns.scatterplot(x="Goal(USD Millions)", y="Pledged(USD Millions)", s=130, hue='state' , data=df_clean)

plt.show()
df_clean = df_clean.drop(["Goal(USD Millions)", "Pledged(USD Millions)"], axis = 1)

Distributions - Outliers and Skew

Numeric variables such as backers, usd_pledged_real, usd_goal_real are higly right skewed because of so many failed instances not having single backers or pledged amount raised. This will be addressed through data normalization while developing a model.

To explore these data it needs to be transformed and then histogram should be created to visualize distributions.

| skew | goal_real - 12.765938 | Pledged_real - 82.063085 | backers - 86.294188 |

Column usd_goal_real_log usd_pledged_real_log
count 369678.000000 369678.000000
mean 8.632460 5.775453
std 1.671539 3.309677
min 0.009950 0.000000
25% 7.601402 3.526361
50% 8.612685 6.456770
75% 9.662097 8.314587
max 14.591996 16.828050

Minimum goal amount is as small as 0.01

In [51]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

#General Stats
df_clean["usd_goal_real_log"] = np.log(df_clean.usd_goal_real+1)
df_clean["usd_pledged_real_log"] = np.log(df_clean.usd_pledged_real+1)
#df_clean["backers_log"] = np.log(df_clean.backers+1)
# drop by Name
df1 = df_clean.drop(['usd_goal_real', 'usd_pledged_real', 'backers'], axis=1)
#print (df1.describe())
del df1
df_clean.drop(['usd_goal_real_log', 'usd_pledged_real_log'], axis=1, inplace = True)
gc.collect()

#print("Minimum goal amount is as small as 0.01")

#configure_plotly_browser_state()
df_cancel = df_clean[df_clean["state"] == "canceled"]
df_failed = df_clean[df_clean["state"] == "failed"]
df_sucess = df_clean[df_clean["state"] == "successful"]


#First plot
trace0 = go.Histogram(
    x= np.log(df_clean.usd_goal_real+1),
    histnorm='probability', showlegend=False,
    xbins=dict(
        start=-5.0,
        end=19.0,
        size=1),
    autobiny=True)

#Second plot
trace1 = go.Histogram(
    x = np.log(df_clean.usd_pledged_real+1),
    histnorm='probability', showlegend=False,
    xbins=dict(
        start=-1.0,
        end=17.0,
        size=1))

# Add histogram data
x1 = np.log(df_failed['usd_goal_real']+1)
x2 = np.log(df_sucess["usd_goal_real"]+1)
x3 = np.log(df_cancel["usd_goal_real"]+1)

trace3 = go.Histogram(
    x=x1,
    opacity=0.60, nbinsx=30, name='Goals Failed', histnorm='probability'
)
trace4 = go.Histogram(
    x=x2,
    opacity=0.60, nbinsx=30, name='Goals Sucessful', histnorm='probability'
)
trace5 = go.Histogram(
    x=x3,
    opacity=0.60, nbinsx=30, name='Goals Cancelled', histnorm='probability'
)


data = [trace0, trace1, trace3, trace4, trace5]
layout = go.Layout(barmode='overlay')

#Creating the grid
fig = tls.make_subplots(rows=2, cols=2, specs=[ [{'colspan': 2}, None], [{}, {}]],
                          subplot_titles=('Failed, Cancelled and Sucessful Projects',
                                          'Goal','Pledged'))

#setting the figs
fig.append_trace(trace0, 2, 1)
fig.append_trace(trace1, 2, 2)
fig.append_trace(trace3, 1, 1)
fig.append_trace(trace4, 1, 1)
fig.append_trace(trace5, 1, 1)

fig['layout'].update(title="(Data Exploration) Log Transformed Distribuitions",
                     height=500, width=900, barmode='overlay')
iplot(fig)
This is the format of your plot grid:
[ (1,1) x1,y1           -      ]
[ (2,1) x2,y2 ]  [ (2,2) x3,y3 ]

Distributions of Monetory Columns against Class Variable - State

Dataset Amount values are highly right skewed and to view distributions it must be log transformed.

Logarithm: Log of a variable is a common transformation method used to change the shape of distribution of the variable on a distribution plot. It is generally used for reducing right skewness of variables. Though, It can’t be applied to zero or negative values as well.

Distribution shows:

  • Successful Projects had relatively small fundraising goals compare to failed or cancelled Projects.
  • Cancelled and Failed Project goal amount is high after median.
  • 16 % of pledged amount is around 1 USD.

STEP 3. Feature Engineering

  1. Time Data: Launched Year, Launched Month, Launch Day, is_weekend, duration

  2. Categorical Data: Create Dummies for Main Category and Country. Categorical Levels: main_category(15) and category(159) are different level of categories.

  3. Backers - Number of people supporting the project.

  4. Numerical Data: Generate Number of Projects and Mean Goal Amount for each Main category and Sub category, Difference in mean main_category and mean sub category to goal amount. Goal - Total fund needed to execute the project and pledged amount is amount raised so far. usd_pledged_real and usd_pledged goal is USD conversion from different currencies using online conversion API.

  1. Text Features: Text Information: name is project name and different text features can be extracted using feature extraction techniques.

Identify values from Project name column. Extract Length, Percentage of Punctuations, Syllable Count, Character Count, Number of Words, Stopwords Count, Capitalized word counts, Number of numeric values and then Clean the data for plotting word cloud

Time: Launched and deadline can be used to identify and extract time related features.

  • Clean Data Shape: (369670, 11)
  • Added Text Features Shape: (369670, 20)
  • Added Numerical Features Shape: (369670, 66)
In [43]:
def syllable_count(word):
    word = word.lower()
    vowels = "aeiouy"
    count = 0
    if word[0] in vowels:
        count += 1
    for index in range(1, len(word)):
        if word[index] in vowels and word[index - 1] not in vowels:
            count += 1
    if word.endswith("e"):
        count -= 1
    if count == 0:
        count += 1
    return count

def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/len(words))

def _clean(txt): #test['name'] = df_ks['name'].apply(_clean)
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    txt = txt.lower()
    # punctuation removal 
    txt = ''.join(x for x in txt if x not in string.punctuation)
    txt = re.sub('[%s]' % re.escape(string.punctuation), ' ', txt)
    txt = re.sub('[‘’“”…]', ' ', txt)
    txt = re.sub('\n', ' ', txt)
    txt = re.sub('\w*\d\w*', ' ', txt)

    # stopwords removal  
    word_tokens = word_tokenize(txt)    
    #text_list = [w for w in word_tokens if not w in stop_words]  
    clean_txt = ""
  
    for w in word_tokens:
        if w.lower() not in stop_words:
            clean_txt += " "
            clean_txt += w 
    
    clean_txt = ' '.join(clean_txt.split()) # Removing multiple whitespaces
    noise = ['canceled']
    for ns in noise:
        clean_txt = clean_txt.replace(ns, "")

    return clean_txt

## feature engineering

def features1(projects):
    projects["launched_year"]   = projects["launched"].dt.year
    projects["launched_month"]   = projects["launched"].dt.month
    projects["launched_week"]    = projects["launched"].dt.week
    projects["launched_day"]     = projects["launched"].dt.weekday
    projects["is_weekend"]       = projects["launched_day"].apply(lambda x: 1 if x > 4 else 0)
    #projects["state"]            = projects["state"].apply(lambda x: 1 if x=="successful" else 0)
    projects["duration"]         = projects["deadline"] - projects["launched"]
    projects["duration"]         = projects["duration"].apply(lambda x: int(str(x).split()[0]))
    projects = pd.get_dummies(projects['country']).join(projects)
    projects = pd.get_dummies(projects['main_category']).join(projects)  
    ## label encoding the categorical features
    #projects = pd.concat([projects, pd.get_dummies(projects["main_category"])], axis = 1)
    le = LabelEncoder()
    for c in ["category", "main_category"]:
        projects[c] = le.fit_transform(projects[c])

    ## Generate Count Features related to Category and Main Category
    t2 = projects.groupby("main_category").agg({"usd_goal_real" : "mean", "category" : "sum"}) # Mean and count
    t1 = projects.groupby("category").agg({"usd_goal_real" : "mean", "main_category" : "sum"})
    t2 = t2.reset_index().rename(columns={"usd_goal_real" : "mean_main_category_goal", "category" : "main_category_count"})
    t1 = t1.reset_index().rename(columns={"usd_goal_real" : "mean_category_goal", "main_category" : "category_count"})
    projects = projects.merge(t1, on = "category")
    projects = projects.merge(t2, on = "main_category")
    projects["diff_mean_category_goal"] = projects["mean_category_goal"] - projects["usd_goal_real"]
    projects["diff_mean_category_goal"] = projects["mean_main_category_goal"] - projects["usd_goal_real"]
    projects["diff_pledged_goal_real"] = projects["usd_pledged_real"] - projects["usd_goal_real"]
    projects = projects.drop(["launched", "deadline"], axis = 1)
    return projects

def text_feat(df):
    # Function to calculate length of message excluding space
    df['name_len'] = df['name'].apply(lambda x: len(x) - x.count(" "))
    df['punct%'] = df['name'].apply(lambda x: count_punct(x))
    df["syllable_count"]   = df["name"].apply(lambda x: syllable_count(x))
    df["num_words"]  = df["name"].apply(lambda x: len(x.split()))
    df["num_chars"]  = df["name"].apply(lambda x: len(x.replace(" ","")))
    df['avg_word'] = df['name'].apply(lambda x: avg_word(x))
    df['num_stopwords'] = df['name'].apply(lambda x: len([x for x in x.split() if x in stop_words]))
    df['num_numerics'] = df['name'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
    df['num_capitalized'] = df['name'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
    df['name'] = df['name'].apply(_clean)
    
    return df


print("Clean Data Shape:", df_clean.shape)
df_text_feat = text_feat(df_clean)
#df_text_feat_tfidf = name_tfidf(df_text_feat)
print("Added Text Features Shape:",df_text_feat.shape)
df_feat = features1(df_text_feat)
print("Added Numerical Features Shape:", df_feat.shape)
#df_feat = category_tfidf(df_text_feat)
#print("Added Category TF-IDF Shape:", df_feat.shape)
Clean Data Shape: (369670, 11)
Added Text Features Shape: (369670, 20)
Added Numerical Features Shape: (369670, 66)
In [47]:
from wordcloud import WordCloud, STOPWORDS

# Thanks : https://www.kaggle.com/aashita/word-clouds-of-various-shapes ##
def plot_wordcloud(text, mask=None, max_words=200, max_font_size=100, title = None, title_size=40, image_color=False):
    stopwords = set(STOPWORDS)
    more_stopwords = {'school', 'miami', 'canceled'}
    stopwords = stopwords.union(more_stopwords)

    wordcloud = WordCloud(background_color='black',
                    stopwords = stopwords,
                    max_words = max_words,
                    max_font_size = max_font_size, 
                    random_state = 42,
                    width=800, 
                    height=400,
                    mask = mask)
    wordcloud.generate(str(text))
    
    #plt.figure(figsize=figure_size)
    if image_color:
        image_colors = ImageColorGenerator(mask);
        plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear");
        plt.title(title, fontdict={'size': title_size,  
                                  'verticalalignment': 'bottom'})
    else:
        plt.imshow(wordcloud);
        plt.title(title, fontdict={'size': title_size, 'color': 'black', 'verticalalignment': 'bottom'})
    plt.axis('off');
    plt.tight_layout()  
    

plt.figure(figsize=(16,10))
#plt.suptitle('Bottom Performing Universities and Colleges (Some Campaign not ended)', fontsize=24)

plt.subplot(2,2,1)
plot_wordcloud(df_text_feat["name"], title="Project Name")

plt.subplot(2,2,2)
plot_wordcloud(df_clean["category"], title="Sub-category")

STEP 4. Dimensionality Reduction (or Feature Selection)

1. Low Variance Filter
2. High Correlation filter
3. Backward Elimination
4. Recursive Feature Elimination

Output from variance, correlation, p-value metric does not reduce much and are not helpful. Later, LDA - did not help. Based on Recursive Elimination using RandomForest Classifier - It gives optimal set of features that can be used for training and testing Predictive model.

  • Optimum number of features: 12
  • Score with 12 features: 0.926990
  • Selected Features: Index(['backers', 'usd_pledged_real', 'usd_goal_real', 'name_len', 'punct%','syllable_count', 'num_chars', 'avg_word', 'launched_year','launched_week', 'duration', 'diff_mean_category_goal'], dtype='object')
In [4]:
#Dataframe
df_feat = pd.read_pickle('df_features.pkl')
df_test= df_feat.head(10000)

y = df_test.state # setting output variable 
features = [c for c in df_test.columns if c not in ["state", "name", "diff_pledged_goal_real", 'country']]
X = df_test[features] # choosing initial features
print("Before Balancing Shape X:", X.shape, "y: ", y.shape)
ad = ADASYN()
X_ad, y_ad = ad.fit_sample(X, y)
print("After Balancing Shape X:", X_ad.shape, "y: ", y_ad.shape)
X_train, X_test, y_train, y_test = train_test_split(X_ad,y_ad, test_size = 0.25, random_state = 0)
#Normalizing the features 
sc_X = StandardScaler() 
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
print("Normalize")
#no of features
nof_list=np.arange(1, X.shape[1])            
high_score=0
#Variable to store the optimum features
nof=0           
score_list =[]
for n in range(len(nof_list)):
    model = RandomForestClassifier(criterion='entropy')
    rfe = RFE(model,nof_list[n])
    X_train_rfe = rfe.fit_transform(X_train,y_train)
    X_test_rfe = rfe.transform(X_test)
    model.fit(X_train_rfe,y_train)
    score = model.score(X_test_rfe,y_test)
    score_list.append(score)
    if(score>high_score):
        high_score = score
        nof = nof_list[n]
        print(n, ": ", nof)

print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))

cols = list(X.columns)
model = RandomForestClassifier(criterion='entropy', n_jobs=3)
#Initializing RFE model
rfe = RFE(model, nof)             
#Transforming data using RFE
X_rfe = rfe.fit_transform(X,y)  
#Fitting the data to model
model.fit(X_rfe,y)              
temp = pd.Series(rfe.support_,index = cols)
selected_features_rfe = temp[temp==True].index
print("Selected Features: ", selected_features_rfe)
Before Balancing Shape X: (10000, 62) y:  (10000,)
After Balancing Shape X: (18242, 62) y:  (18242,)
Normalize
0 :  1
1 :  2
2 :  3
3 :  4
4 :  5
5 :  6
6 :  7
7 :  8
8 :  9
11 :  12
Optimum number of features: 12
Score with 12 features: 0.926990
Selected Features:  Index(['backers', 'usd_pledged_real', 'usd_goal_real', 'name_len', 'punct%',
       'syllable_count', 'num_chars', 'avg_word', 'launched_year',
       'launched_week', 'duration', 'diff_mean_category_goal'],
      dtype='object')

STEP 5. Various Model Evaluation

Modelling Classification:-

  • Rebalance class variable of using data balancing technique ADASYN - Over sampling
  • Save Balanced set of selected feature values for later use such that re-execution of all above steps is not necessary.
  • Apply Various Models with default settings and Check Accuracy/Missclassification Rate
  • Model Prediction done on trained and Test dataset to evaluate whether model predicts good on what it learned and whether it is generalizing on unseen data(i.e Test).

Execute Various Classifier Algorithms and Note Accuracy

  1. Model with Default Parameters
  2. Tuned Model
  • Before Balancing Shape X: (369678, 12) y: (369678,)
  • After Balancing Shape X: (584054, 12) y: (584054,)
In [26]:
## define predictors and label 
#Dataframe
df_feat = pd.read_pickle('df_features.pkl')
labelencoder_X = LabelEncoder() 
df_feat['state'] = labelencoder_X.fit_transform(df_feat['state'])
#df_test= df_feat.head(10000)
features = [c for c in df_feat.columns if c in ['backers', 'usd_pledged_real', 'usd_goal_real', 'name_len', 'punct%',
       'syllable_count', 'num_chars', 'avg_word', 'launched_year',
       'launched_week', 'duration', 'diff_mean_category_goal']]
            
'''            ['category', 'backers', 'usd_pledged_real', 'usd_goal_real', 'name_len',
       'punct%', 'syllable_count', 'num_chars', 'avg_word', 'launched_year',
       'launched_month', 'launched_week', 'duration', 'mean_category_goal',
       'category_count', 'mean_main_category_goal', 'diff_mean_category_goal']'''
X = df_feat[features]
y = df_feat.state
print("Before Balancing Shape X:", X.shape, "y: ", y.shape)
ad = ADASYN()
X, y = ad.fit_sample(X, y)
print("After Balancing Shape X:", X.shape, "y: ", y.shape)
#[c for c in df_feat.columns if c in ["usd_pledged_real","usd_goal_real","diff_mean_category_goal"]]
#[c for c in df_feat.columns if c not in ["state", "name", "backers","usd_pledged_real","diff_pledged_goal_real", 'country']]

#Dataframe
#data = pd.read_pickle('dtm.pkl')
# Let's pickle it for later use
#X.to_pickle("X_without_pledged_backers.pkl")
#y.to_pickle("y_without_pledged_backers.pkl")

with open('X_with_pledged_backers_12.pkl','wb') as f:
    pickle.dump(X, f)
    f.close()
with open('y_with_pledged_backers_12.pkl','wb') as f:
    pickle.dump(y, f)
    f.close()
Before Balancing Shape X: (369678, 12) y:  (369678,)
After Balancing Shape X: (584054, 12) y:  (584054,)
In [20]:
#Dataframe
#X = pd.read_pickle('X_without_pledged_backers.pkl')
#y = pd.read_pickle('y_without_pledged_backers.pkl')
#RFE - X_without_pledged_backers
#y_without_pledged_backers
with open('X_with_pledged_backers_12_10k.pkl','rb') as f:
    X = pickle.load(f)
    print(X.shape)
    f.close()
with open('y_with_pledged_backers_12_10k.pkl','rb') as f:
    y = pickle.load(f)
    print(y.shape)
    f.close()

#Splitting the data into Training Set and Test Set
## prepare training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 2, stratify=y)

#Normalizing the features 
sc_X = StandardScaler() 
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

#10k 18184,12
(18184, 12)
(18184,)
In [19]:
from sklearn.metrics import accuracy_score
## Generic Execution Function
def model_exec(model, name):
    print('Running Model')
    model.fit(X_train,y_train)

    #Making predictions on the Train and Test Set
    y_train_pred = model.predict(X_train)
    y_pred = model.predict(X_test)

    #Evaluating the predictions using a Confusion Matrix
    print("Training Set Accuracy:")
    print(accuracy_score(y_train_pred, y_train))
    print("Test Set Accuracy:\n", accuracy_score(y_pred, y_test))

#    print(confusion_matrix(y_train, y_train_pred))
    df_cmtr = pd.DataFrame(confusion_matrix(y_train, y_train_pred), index = ['cancelled', 'failed', 'successful'],
                  columns = ['cancelled(p)', 'failed(p)', 'successful(p)'])
    df_cm = pd.DataFrame(confusion_matrix(y_test, y_pred), index = ['cancelled', 'failed', 'successful'],
                  columns = ['cancelled(p)', 'failed(p)', 'successful(p)'])
    plt.figure(figsize=(14,10))
    s_title = name + ' Confusion Matrix'
    plt.suptitle(s_title, fontsize=16)

    plt.subplot(2,2,1)
    plt.gca().set_title('Train Data')
    sns.heatmap(df_cmtr, annot=True, cmap=plt.cm.Reds)
    plt.subplot(2,2,2)
    plt.gca().set_title('Test Data')
    sns.heatmap(df_cm, annot=True, cmap=plt.cm.Reds)
    plt.show()

    # save the model to disk
    filename = 'Models/'+name+'.sav'
    pickle.dump(model, open(filename, 'wb'))
    return model

def model_exec_wholedata_final(model, name):
    print('Running Model')
    #Normalizing the features - based on traning 
    X_whole = sc_X.transform(X)
    print("Normalized Whole data:", X_whole.shape)
    model.fit(X_whole,y)
    print("Whole data Fit:\n%r" % model.data)
    #Making predictions on the Train and Test Set
    y_train_pred = model.predict(X_train)
    y_pred = model.predict(X_test)

    #Evaluating the predictions using a Confusion Matrix
    print("Training Set Accuracy:")
    print(accuracy_score(y_train_pred, y_train))
    print("Test Set Accuracy:\n", accuracy_score(y_pred, y_test))

    #print(confusion_matrix(y_train, y_train_pred))
    df_cmtr = pd.DataFrame(confusion_matrix(y_train, y_train_pred), index = ['cancelled', 'failed', 'successful'],
                  columns = ['cancelled(p)', 'failed(p)', 'successful(p)'])
    df_cm = pd.DataFrame(confusion_matrix(y_test, y_pred), index = ['cancelled', 'failed', 'successful'],
                  columns = ['cancelled(p)', 'failed(p)', 'successful(p)'])
    plt.figure(figsize=(14,10))
    s_title = name + ' Confusion Matrix'
    plt.suptitle(s_title, fontsize=16)

    plt.subplot(2,2,1)
    plt.gca().set_title('Train Data')
    sns.heatmap(df_cmtr, annot=True, cmap=plt.cm.Reds)
    plt.subplot(2,2,2)
    plt.gca().set_title('Test Data')
    sns.heatmap(df_cm, annot=True, cmap=plt.cm.Reds)
    plt.show()

    # save the model to disk
    filename = 'Models/'+name+'.sav'
    pickle.dump(model, open(filename, 'wb'))
    return model

def feature_imp(model, plt_title, plot):
    # Feature Importance graph
    features = ['backers', 'usd_pledged_real', 'usd_goal_real', 'name_len', 'punct%',
           'syllable_count', 'num_chars', 'avg_word', 'launched_year',
           'launched_week', 'duration', 'diff_mean_category_goal']
    importances = model.feature_importances_
    indices = np.argsort(importances)
    plt.figure(figsize=(16,10))
    plot.title(plt_title)
    plot.barh(range(len(indices)), importances[indices], color='b', align='center')
    plot.yticks(range(len(indices)), [features[i] for i in indices])
    plot.xlabel('Relative Importance')
    plot.show()

RandomForestClassifier

In [27]:
#Fitting Classifier to Training Set. Create a classifier object here and call it classifierObj 
RForest = RandomForestClassifier(criterion='entropy') 
RForest = model_exec(RForest, 'RForest')
#10k - Train 0.9972, Test 0.9472
Running Model
Training Set Accuracy:
0.9970914492030913
Test Set Accuracy:
 0.9241081747438169
In [33]:
#Fitting Classifier to Training Set. Create a classifier object here and call it classifierObj 
TunedRForest = RandomForestClassifier(criterion='entropy', max_depth= 60, n_estimators= 1000, bootstrap=False, random_state=2)
TunedRForest = model_exec(TunedRForest, 'TunedRForest')
Running Model
Training Set Accuracy:
1.0
Test Set Accuracy:
 0.9450096233159198
In [34]:
# Training predictions (to demonstrate overfitting)
train_rf_probs = TunedRForest.predict_proba(X_train)[:, 1]

# Testing predictions (to determine performance)
rf_probs = TunedRForest.predict_proba(X_test)[:, 1]

n_nodes = []
max_depths = []

# Stats about the trees in random forest
for ind_tree in TunedRForest.estimators_:
    n_nodes.append(ind_tree.tree_.node_count)
    max_depths.append(ind_tree.tree_.max_depth)
    
print(f'Average number of nodes {int(np.mean(n_nodes))}')
print(f'Average maximum depth {int(np.mean(max_depths))}')
Average number of nodes 3192
Average maximum depth 29
In [35]:
feature_imp(TunedRForest, "TunedRForest", plt)

STEP 6. Hyper Parameter Tuning using Randomized Search CV -

  • Grid Search looks at all possible combinations of values specified for hyperparameters and gives the best combination.
  • RandomizedSearchCV - Randomly selects combination of different parameters to identify best set of parameters.

Example - RandomForestClassifier

  1. To improve the predictive power of the model :

    • max_features - number of features Random Forest is allowed to try in individual tree (increases perfromance but computationally expensive)
    • n_estimators - increases perfromance but computationally expensive
  1. Features which will make the model training easier
    • n_jobs : -1 uses all CPUs
    • random_state : Easy to replicate. A definite value of random_state will always produce same results if given with same parameters and training data.
In [27]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

def hyper_tuning(params, model):
    # Look at parameters used by our current forest
    print('Parameters currently in use:\n')
    pprint(model.get_params())
    print('Parameters Grid:\n')
    pprint(params)    
    # Random search of parameters, using 3 fold cross validation, 
    # search across 100 different combinations, and use all available cores
    random = RandomizedSearchCV(estimator = model, param_distributions = params, n_iter = 100, cv = 3, verbose=2, random_state=1, n_jobs = 3)
    # Fit the random search model
    random.fit(X_train, y_train)
    print('Best:\n')
    print('Score: ', random.best_score_)
    print ('Estimator: ',random.best_estimator_)
    return random


def plot_hyper(train_res, test_res, paralist, namex, namey):
    # Create a trace
    trace0 = go.Scatter(x = paralist, y = test_res, name = 'Test')
    trace1 = go.Scatter(x = paralist, y = train_res,name = 'Train')
    data = [trace0, trace1]
    # Edit the layout
    layout = dict(title = 'Score Graph', xaxis = dict(title = namex), yaxis = dict(title = namey), )
    fig = dict(data=data, layout=layout)
    py.iplot(fig, filename='styled-line')
In [23]:
#Random Forest

RFTuning = RandomForestClassifier(max_depth=60, bootstrap=False, criterion='entropy')

# Number of trees in random forest, linspace returns evenly spaced number
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 775, num = 25)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 5)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [int(x) for x in np.linspace(start = 20, stop = 80, num = 5)]
# Minimum number of samples required at each leaf node
min_samples_leaf = [int(x) for x in np.linspace(start = 1, stop = 80, num = 5)]
# Method of selecting samples for training each tree
bootstrap = [False]
# Create the random grid
random_grid = {'n_estimators': n_estimators}

results = hyper_tuning(random_grid, RFTuning)
Parameters currently in use:

{'bootstrap': False,
 'class_weight': None,
 'criterion': 'entropy',
 'max_depth': 60,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 'warn',
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}
Parameters Grid:

{'n_estimators': [10,
                  41,
                  73,
                  105,
                  137,
                  169,
                  201,
                  233,
                  265,
                  296,
                  328,
                  360,
                  392,
                  424,
                  456,
                  488,
                  520,
                  551,
                  583,
                  615,
                  647,
                  679,
                  711,
                  743,
                  775]}
Fitting 3 folds for each of 25 candidates, totalling 75 fits
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  35 tasks      | elapsed:  1.7min
[Parallel(n_jobs=3)]: Done  75 out of  75 | elapsed:  7.3min finished
Best:

Score:  0.9361380353337458
Estimator:  RandomForestClassifier(bootstrap=False, class_weight=None,
            criterion='entropy', max_depth=60, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=711, n_jobs=None, oob_score=False,
            random_state=None, verbose=0, warm_start=False)
In [24]:
pd.DataFrame(results.cv_results_).sort_values('mean_test_score', ascending=False).head()
Out[24]:
mean_fit_time std_fit_time mean_score_time std_score_time param_n_estimators params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score split0_train_score split1_train_score split2_train_score mean_train_score std_train_score
22 30.136463 0.057520 0.624862 0.022085 711 {'n_estimators': 711} 0.940825 0.937925 0.929662 0.936138 0.004729 1 1.0 1.0 1.0 1.0 0.0
7 9.664397 0.051548 0.187447 0.000011 233 {'n_estimators': 233} 0.939381 0.938544 0.930074 0.936001 0.004204 2 1.0 1.0 1.0 1.0 0.0
19 24.626379 0.231953 0.494693 0.026554 615 {'n_estimators': 615} 0.938969 0.938750 0.930074 0.935932 0.004142 3 1.0 1.0 1.0 1.0 0.0
23 30.876124 0.116481 0.667230 0.027538 743 {'n_estimators': 743} 0.938969 0.938544 0.930074 0.935863 0.004096 4 1.0 1.0 1.0 1.0 0.0
21 26.650421 0.165978 0.608592 0.057565 679 {'n_estimators': 679} 0.939588 0.937719 0.929868 0.935726 0.004211 5 1.0 1.0 1.0 1.0 0.0
In [33]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
pd.DataFrame(results.cv_results_).shape

# Create a trace
trace0 = go.Scatter(x = list(range(0,25)), y = results.cv_results_['mean_test_score'], name = 'Test')
trace1 = go.Scatter(x = list(range(0,25)), y = results.cv_results_['mean_train_score'],name = 'Train')
data = [trace0, trace1]
# Edit the layout
layout = dict(title = 'Score Graph', xaxis = dict(title = 'Para_set'), yaxis = dict(title = 'Accuracy'), )
fig = dict(data=data, layout=layout)
py.iplot(fig, filename='styled-line')

STEP 7. KFold and Ensembling: Model Selection

KFold

A model can either suffer from underfitting (high bias) if the model is too simple, or it can overfit the training data (high variance) if the model is too complex for the underlying training data.

Ensemble

The main principle behind ensemble modelling is to group weak learners together to form one strong learner. Combine the decisions from multiple models to improve the overall performance.

Errrors in Model - variance, bias, noise

  1. Max Voting
  2. Averaging
  3. Weighted Averaging
  4. Bagging to decrease the model’s variance; RandomForest (Train Data Bootstraping and Aggregation)
  5. Boosting to decreasing the model’s bias, and; XGBoost (New model is trained from the errors of previous learners)
  6. Stacking to increasing the predictive force of the classifier. (new model is trained from the combined predictions of two (or more) previous model.)
In [92]:
#K-Fold Cross Validation
def model_kfold(model, name):
    print(name, " KFold Evaluation:")
    modelAccuracies= cross_val_score(estimator=model, X=X_train, y=y_train, cv=10, verbose=2, n_jobs=3)
    print(name, " Accuracies:", modelAccuracies)
    print(name, " Accuracy:", modelAccuracies.mean())
    print(name, " Accuracy:", modelAccuracies.std())
    
model_kfold(TunedDecTree, 'TunedDecTree')
model_kfold(TunedRForest, 'TunedRForest')
model_kfold(TunedBaggingC, 'TunedBaggingC')
model_kfold(Tunedmodelxgb, 'Tunedmodelxgb')
model_kfold(simplegbm_model, 'simplelgbm_model')
TunedDecTree  KFold Evaluation:
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  10 out of  10 | elapsed:   53.5s finished
TunedDecTree  Accuracies: [0.9078074  0.90930004 0.90708742 0.91035209 0.9083326  0.90778822
 0.9053297  0.90796382 0.9078409  0.90894563]
TunedDecTree  Accuracy: 0.9080747822473958
TunedDecTree  Accuracy: 0.0012744890869089748
TunedRForest  KFold Evaluation:
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  10 out of  10 | elapsed: 13.1min finished
TunedRForest  Accuracies: [0.939188   0.94124258 0.94073333 0.94150496 0.9392923  0.94045131
 0.93913425 0.94083765 0.94046887 0.9403449 ]
TunedRForest  Accuracy: 0.940319815929389
TunedRForest  Accuracy: 0.0008048414543944626
TunedBaggingC  KFold Evaluation:
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  10 out of  10 | elapsed:  4.9min finished
TunedBaggingC  Accuracies: [0.9258069  0.92770344 0.92668493 0.92884362 0.92649047 0.9275968
 0.92671876 0.92657828 0.92640267 0.92752529]
TunedBaggingC  Accuracy: 0.9270351169198616
TunedBaggingC  Accuracy: 0.0008312427716826074
Tunedmodelxgb  KFold Evaluation:
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  10 out of  10 | elapsed: 94.7min finished
Tunedmodelxgb  Accuracies: [0.94062796 0.94277034 0.94196256 0.94345421 0.94252349 0.9422952
 0.94122399 0.94185618 0.94155764 0.94375176]
Tunedmodelxgb  Accuracy: 0.9422023327612395
Tunedmodelxgb  Accuracy: 0.0009182073197103619
simplelgbm_model  KFold Evaluation:
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
simplelgbm_model  Accuracies: [0.92880975 0.92912584 0.92937169 0.93123189 0.92852753 0.92912459
 0.92845728 0.93047678 0.92824655 0.93224923]
simplelgbm_model  Accuracy: 0.9295621124628214
simplelgbm_model  Accuracy: 0.0012590128577712595
[Parallel(n_jobs=3)]: Done  10 out of  10 | elapsed:  1.6min finished

Finally Selected Model

Selected Random Forest(Tuned) as it gives accuracy of 0.9403, 0.0008 and better on predicting class variable.

Models with low bias and low variance.

Model Training Accuracy Testing Accuracy KFold Time Size Complexity
Random Forest(Tuned) 0.9999 0.9403 0.9403, 0.0008 3m 34s 474 MB Interepretable, nonlinear
XGboost(Tuned) 0.9992 0.9413 0.9422, 0.0009 20m 33s 92 MB Interepretable
Stacking (ExtraTree and Bagging DecTree) 0.9996 0.9423 (Individual) both 0.92, 0.00 2m 31s 7 GB Somewhat Interepretable and Complex
Stacking (RForest and XGB) 0.9995 0.9410 (Individual) both 0.94, 0.00 35m 58s GBs Somewhat Interepretable and Complex
Stacking (ExtraTree and XGB) 0.9995 0.9414 (Individual) 0.92, 0.94, 0.00 43m 20s GBs Somewhat Interepretable and Complex
  • To avoid overfit - KFold Cross validation to check fold train/test score distributions. TRain and Test should not vary much.
  • To avoid bias - stratify sampling and ADASYN Oversampling to balance class variable.
In [14]:
from mlens.ensemble import SuperLearner
from sklearn.metrics import accuracy_score
###FINAL DELIVERABLE - RF

load_selected_models()
ensembleRF = SuperLearner(scorer = accuracy_score, random_state=2, folds=10, n_jobs=3)

# Build the first layer
ensembleRF.add([TunedRForest])
# Attach the final meta estimator
ensembleRF.add_meta(DecTree)
ensembleRF = model_exec_wholedata_final(ensembleRF, 'FinalRFTrained')
print("Fit data:\n%r" % ensembleRF.data)
Normalized Whole data: (584054, 12)
Running Model
Training Set Accuracy:
0.9999978597860214
Test Set Accuracy:
 0.9999914391624076
Fit data:
                                   score-m  score-s    ft-m   ft-s  pt-m  pt-s
layer-1  randomforestclassifier       0.92     0.02  199.54  21.96  0.92  0.18

In [18]:
from mlens.ensemble import SuperLearner
from sklearn.metrics import accuracy_score
###FINAL DELIVERABLE - XGB
load_selected_models()
ensembleXGB = SuperLearner(scorer = accuracy_score, random_state=2, folds=10, n_jobs=3)

# Build the first layer
ensembleXGB.add([Tunedmodelxgb])
# Attach the final meta estimator
ensembleXGB.add_meta(DecTree)
ensembleXGB = model_exec_wholedata_final(ensembleXGB, 'FinalXGBTrained')
Running Model
Training Set Accuracy:
0.9992252425397491
Test Set Accuracy:
 0.999169598753542
Fit data:
                          score-m  score-s     ft-m    ft-s  pt-m  pt-s
layer-1  xgbclassifier       0.91     0.05  1482.13  120.05  6.97  1.81

Neurons - Keras Multiclass Classifier with Four Layers

Drawing

In [21]:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
In [22]:
# fix random seed for reproducibility
seed = 2
#np.random.seed(seed)
# define baseline model
def baseline_model():
	# create model
	model = Sequential()
	#First Hidden 
	model.add(Dense(12, input_dim=12, activation='relu'))
	#Second Hidden
	model.add(Dense(8, input_dim=12, activation='relu'))
	#Third Hidden
	model.add(Dense(8, input_dim=8, activation='relu'))
  #4th Hidden
	model.add(Dense(8, input_dim=8, activation='relu'))
	model.add(Dense(3, activation='softmax'))
	# Compile model
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model
In [25]:
from keras.utils.np_utils import to_categorical
print ("Train Predictions:")

scores = model.evaluate(X_train, yt_kr)
print("Score - \n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
tpredictions = model.predict_classes(X_train)
tprediction_ = np.argmax(to_categorical(tpredictions), axis = 1)

print ("Test Predictions:")
scores = model.evaluate(X_test, ytest_kr)
print("Score - \n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

predictions = model.predict_classes(X_test)
prediction_ = np.argmax(to_categorical(predictions), axis = 1)
#prediction_ = labelencoder_X.inverse_transform(prediction_)
#y_test_orig = labelencoder_X.inverse_transform(y_test)

#print(confusion_matrix(y_test, prediction_))
#    print(confusion_matrix(y_train, y_train_pred))
df_cmtr = pd.DataFrame(confusion_matrix(y_train, tprediction_), index = ['cancelled', 'failed', 'successful'],
                  columns = ['cancelled(p)', 'failed(p)', 'successful(p)'])
df_cm = pd.DataFrame(confusion_matrix(y_test, prediction_), index = ['cancelled', 'failed', 'successful'],
                  columns = ['cancelled(p)', 'failed(p)', 'successful(p)'])
plt.figure(figsize=(14,10))
s_title ='NN Confusion Matrix'
plt.suptitle(s_title, fontsize=16)

plt.subplot(2,2,1)
plt.gca().set_title('Train Data')
sns.heatmap(df_cmtr, annot=True, cmap=plt.cm.Reds)
plt.subplot(2,2,2)
plt.gca().set_title('Test Data')
sns.heatmap(df_cm, annot=True, cmap=plt.cm.Reds)
plt.show()

#for i, j in zip(prediction_ , y_test_orig):
#    print( " the nn predict {}, and the species to find is {}".format(i,j))
Train Predictions:
467243/467243 [==============================] - 5s 10us/step
Score - 
acc: 89.99%
Test Predictions:
116811/116811 [==============================] - 1s 10us/step
Score - 
acc: 89.94%

Conclusion

  • Deliverable:

    • Selected Random Forest(Tuned) as it gives accuracy of 0.9403, 0.0008 and better on predicting class variable.
    • Retrained model on whole dataset(Train + Test) with 10-KFold.
  • Improvements:

  • Feature Engineering and Hypothesis Generation can still change set of input columns to optimize model. Bag of words, TF-IDF etc...
  • Interpreting model predictions would add extra benefits. Relative Feature Inportance, Permutation Importance, Partial Feature Dependencies, SHAP Values
  • Baysian Optimization can be used to perform Hyperparameter Tuning.